perm filename FORMAN[KI,ALS] blob sn#097064 filedate 1974-04-14 generic text, type T, neo UTF8
00100	The Stanford AI Pitch-Synchronous Fourier-Transform Formant Extractor
00200	
00300	The formant extractor is not a formant tracker  in  the  usual  sense
00400	since a fresh determination of the formant locations is made for each
00500	segment independently. This is thought to be desirable as it  reveals
00600	rapid  changes  in  formant location, particularly in the vicinity of
00700	obstruants  where  the  character  of  the  obstruant  is  frequently
00800	revealed  more by these rapid transitions than by anything else. Only
00900	after this has been done is any attempt made to  reconcile  data  for
01000	adjacent segments, as will be explained later.
01100	
01200	Formant  identification  is  based  on  the use of Fourier transforms
01300	using single pitch period segments where the segment starts and  ends
01400	at  the  zero  crossing  which  precedes  the  maximum  excursion  in
01500	amplitude.
01600	
01700	A study has been made of the effects of the segment  location  within
01800	the  period  and  of  the  effect  of  the segment length. In general
01900	cleaner transforms are produced when the segment length is  something
02000	less  than  the  full period, 80% seems to be a reasonable compromise
02100	between cleanness and unwarranted broadening  of  the  peaks  in  the
02200	spectrum  because  of  insufficient  points  of  data. However, it is
02300	questioned whether this  is  a  reasonable  thing  to  do  since  the
02400	location  of  the  formant  peaks  is affected by the glottal loading
02500	during the latter part of the period and this is, of course, removed.
02600	It  seems  more  reasonable  to  assume that the speaker modifies the
02700	shape of his upper vocal tract to  compensate  for  his  own  pecular
02800	glottal  loading  effects  since  he  attempts to produce sounds that
02900	match those produced by others and it is highly unlikely that the ear
03000	can  do  anything  to  disambiguate  glottal  coupling effects. It is
03100	observed that this glottal loading  effect  is  more  pronounced  for
03200	pitch  periods  that  happen  to  be longer than the average. For all
03300	appearances it seems that most speakers  delay  the  closing  of  the
03400	glottis  rather  than  lengthening the closed time when they drop the
03500	pitch of their voice. A reasonable thing to do thus seems  to  be  to
03600	use  the  full  period  for  intervals  are  normal or shorter and to
03700	restrict the length to the average length for long periods.
03800	
03900	The location of the formant peaks is also shifted somewhat by  shifts
04000	in  the  starting  point  in  the  period  since windowing attenuates
04100	contributions to the transform from the edge portions of the data but
04200	this effect is small as compared with the increase in ease with which
04300	the peaks can be located for the starting location as mentioned.
04400	
04500	The first operation is to locate the largest proper  peaks  found  in
04600	each  of six regions, these being the usual ranges for the first five
04700	formants and the region below the usual lower  limit  for  the  first
04800	formant. These limits are shifted between male and female voices, but
04900	in general we have not found it necessary  to  adjust  them  for  the
05000	specific  speaker.  A  proper  peak  is  defined as the largest local
05100	maximum in the region that is bounded on both sides  by  points  that
05200	are  of  lessor  amplitude.  If  the five points for the five formant
05300	regions are distinct, that is no two are assigned the same value, the
05400	points  are  accepted  as  is,  subject  to  a final medial smoothing
05500	operation which will be discribed later.
05600	
05700	Since the ranges for the formants overlap, frequent  conflicts  occur
05800	and  thes  must  now  be  resolved.  This is done starting at the low
05900	frequency end. Somewhat different strategies are used  for  different
06000	possible conflicts.
06100	
06200	Should  the  first  and second formants identifications conflict then
06300	searches are made for the next  largest  proper  peaks,  to  the  low
06400	frequency  side  extending  the  region  to  zero,  and  to  the high
06500	frequency side to the upper limit of the F2 band. The  amplitudes  of
06600	these two new peaks and their positions with respect to median values
06700	for the F1 and F2 regions are then compared. Actually a decision made
06800	on the basis of amplitude only, allowing a 6 db credit for the higher
06900	frequency peak, seems to make the right  decision  almost  always.  A
07000	study  will  be  made  of  this  matter  when a larger sample of data
07100	becomes available.
07200	
07300	Having resolved the conflict between F1 and  F2,  attention  is  then
07400	directed to a possible conflict between F2 and F3 which may have been
07500	introduced by the resolution of the F1 F2 conflict or which may  have
07600	been there initially. If a conflict is newly introduced then a second
07700	look is given to the F1 F2  conflict.  Recourse  is  now  made  of  a
07800	procedure  to  locate  a possible F2 peak that had been obscured by a
07900	dominant F1 peak. The approximate shape of the original F1-F2 peak is
08000	assumed  to  be  parobolic as determined from three data points these
08100	being that point at the maximum and points nearest the two  three  db
08200	down values. A fresh attempt is made to locate a new peak between the
08300	location of the disputed peak which is now  extracted  out  from  the
08400	data  and  the  location  previously  found for F3. If such a peak is
08500	found it is assigned to F2 and attention is  shifted  to  a  possible
08600	F3-F4 conflict.
08700	
08800	Should  an  initial  conflict  be  found  between  F2 and F3, this is
08900	resolved in essentially the same way except that no attempt  is  made
09000	to  find  a  possible  hidden  F3  as  was done for F2. Instead, if a
09100	conflict between F4 and F5 is produced by the resolution of an  F3-F4
09200	conflict  then  this  is  resolved  just  as  if  it  were an initial
09300	conflict.
09400	
09500	
09600	Under  certain circumstances it seems to be impossible to resolve all
09700	conflicts by the procedures just  discribed.  When  this  occurs  the
09800	fai,lure  to  locate  a proper peak is signaled by storing a zero for
09900	the formant in question and the program proceeds to the next formant.
10000	On  the  completion of this first go-around a second look is given to
10100	any zero values, and  finally  if  still  unresolved  the  zeros  are
10200	replaced  by the value for the formant in question by the value found
10300	for the previous time slot.
10400	
10500	Having resolved all conflicts in this way, then the  exact  locations
10600	for  peaks  are  refined  by  parobolic  interpolations  based on the
10700	positions of the highest point and its two nearest neighbors.  It  is
10800	doubtful  if  the greater precision which results from this operation
10900	is at all needed, at least in the case of  512  point  transforms  on
11000	20,000 hertz data. At least 2 bits of added precision can be obtained
11100	and the greatly improved smoothness of the resulting  formant  tracks
11200	seems  to  indicate  that  a  corresponding  increase in accuracy has
11300	resulted.
11400	
11500	The procedures so far describe result in very  good  formant  tracks.
11600	However  there  are  still  isolated points which appear to be out of
11700	line. Most of these appear to be situations where a person  would  be
11800	quite unable to make an assured decision. A certain few can be traced
11900	to failures in the pitch period determining  procedure  while  others
12000	are   due  to  more  obscure  reasons.  In  almost  all  cases  these
12100	abnormalities persist for but a single pitch period and they  can  be
12200	corrected by a final process of medial smoothing. This is done in one
12300	direction only, going forward in time each value for each formant  is
12400	replaced   by  the  median  value  of  the  point  in  question,  its
12500	predecessor (as already  corrected)  and  its  successor.  Individual
12600	points  which  lie  between  their  neighbors are not altered by this
12700	procedure. Errant points are  replaced  by  values  for  the  nearest
12800	neighbor.  This  procedure  does  have  the effect of correcting true
12900	extrema but an extrema which persists for but a single  pitch  period
13000	probably  does not contain much phonetic information and can probably
13100	be ignored. One could make allowances for true  extrema  by  applying
13200	the  medial  smoothing  only  to points that lie more than, say, 2 db
13300	away from their nearest  neighbor.  This  refinement  seems  entirely
13400	unnecessary but it is being kept in reserve.
13500	
13600	The  advantages  of this method of formant extraction over other more
13700	conventional tracking procedures seem to lie  in  the  much  improved
13800	results  in  the  vicinity  of  obstruents where the rapid changes in
13900	formant location can be masked by tracking and where  information  as
14000	to  the  nature  of  the  obstruent  is  contained in this transition
14100	region.